An Agent-Based Approach to Chinese Word Segmentation

نویسندگان

  • Samuel W. K. Chan
  • Mickey W. C. Chong
چکیده

This paper presents the results of our system that has participated in the word segmentation task in the Fourth SIGHAN Bakeoff. Our system consists of several basic components which include the preprocessing, token identification and the post-processing. An agent-based approach is introduced to identify the weak segmentation points. Our system has participated in two open and five closed tracks in five major corpora. Our results have attained top five in most of the tracks in the bakeoff. In particular, it is ranked first in the open track of the corpus from Academia Sinica, second in the closed track of the corpus from City University of Hong Kong, third in two closed tracks of the corpora from State Language Commission of P.R.C. and Academia Sinica.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese Word Auto-Confirmation Agent

In various Asian languages, including Chinese, there is no space between words in texts. Thus, most Chinese NLP systems must perform word-segmentation (sentence tokenization). However, successful word-segmentation depends on having a sufficiently large lexicon. On the average, about 3% of the words in text are not contained in a lexicon. Therefore, unknown word identification becomes a bottlene...

متن کامل

Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based?

Chinese part-of-speech (POS) tagging assigns one POS tag to each word in a Chinese sentence. However, since words are not demarcated in a Chinese sentence, Chinese POS tagging requires word segmentation as a prerequisite. We could perform Chinese POS tagging strictly after word segmentation (one-at-a-time approach), or perform both word segmentation and POS tagging in a combined, single step si...

متن کامل

Chinese Word Segmentation Based on Contextual Entropy

Chinese is written without word delimiters so word segmentation is generally considered a key step in processing Chinese texts. This paper presents a new statistical approach to segment Chinese sequences into words based on contextual entropy on both sides of a bigram. It is used to capture the dependency with the left and right contexts in which a bigram occurs. Our approach tries to segment b...

متن کامل

Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation

Unknown words and word segmentation granularity are two main problems in Chinese word segmentation for ChineseJapanese Machine Translation (MT). In this paper, we propose an approach of exploiting common Chinese characters shared between Chinese and Japanese in Chinese word segmentation optimization for MT aiming to solve these problems. We augment the system dictionary of a Chinese segmenter b...

متن کامل

SYSTRAN's Chinese Word Segmentation

SYSTRAN’s Chinese word segmentation is one important component of its Chinese-English machine translation system. The Chinese word segmentation module uses a rule-based approach, based on a large dictionary and fine-grained linguistic rules. It works on generalpurpose texts from different Chinesespeaking regions, with comparable performance. SYSTRAN participated in the four open tracks in the F...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008